DiffVar: A new method for detecting differential variability with 
application to methylation in cancer and aging 
Supplementary Text 

Belinda Phipson Alicia Oshlack 



1 Calculating absolute or squared residuals 

Let yijk denote the M-value for the i th sample, i = 1, . . . , nk, the j th CpG site, j = 1, . . . , 482 421 and 
the k th group, k = 1, . . . , K. Here nk is the sample size for the k th group and K is the total number 
of groups. For a two group comparison, e.g. cancer vs normal, K = 2. Note that each CpG site is 
analysed independently of the other CpGs. 

The first step in the method is to calculate absolute or squared residuals for each observation in 
each of the K groups for each CpG. For the j th CpG site, the mean M-value for the k th group is 
calculated by 

Absolute deviations, or residuals, are calculated as 
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where \Jn k /{n k — 1) is a leverage factor which takes into account unequal sample sizes. This ensures 
that groups with larger samples sizes are not biased towards detecting larger variances compared to 
groups with smaller sample sizes. If squared deviations are required, 
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2 Calculating moderated t statistics 

The ZijkS capture how much each sample deviates from the group mean. Hence groups that are more 
variable will have larger z^'s on average, and groups that are more consistent will have smaller Zijk's 
on average. Let the true mean of the z^k for group k and CpG site j be denoted [i Zjk . Thus, for two 
groups, testing the null hypothesis H 0 : fi Zjl = Mz j2 effectively tests whether the two group variances 
are equal, or H 0 : cr^ = <Tj 2 . Here cr| fe is the unknown true variance of the y^'s for group k and CpG 
site j. For the j th CpG site and the k th group, the mean absolute (or squared) deviation, zjk, is given 

by ' 
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In general, let zj = (zy, . . . , z n j) be the vector of absolute or squared deviations for CpG site j, where 
n = ni + . . . + fix is the total number of samples in the experiment. We can fit a linear model, 

E(z j ) = Xf3 j , (1) 
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where X is a design matrix of full column rank and f3j is a vector of coefficients. For the two group 

case, with no additional covariates, (3j = (Pjo, Pji), where flj 0 is the intercept and flji is the regression 

coefficient. Here $ji is estimated as the difference between the mean absolute or squared deviation 
between group 1 and 2, 

More generally, in matrix terminology, a vector of regression cocfRcicnts can be estimated by 

= {X T X)- 1 X T z j . 

The variance of the absolute or squared deviations for the j th CpG site is denoted s 2 .. and are the 
residuals obtained from fitting the linear model in Eqn. f . 

The classic Levene's test (Levene, 1960) uses squared deviations and calculates ordinary t-statistics 
in the case of a two group comparison, or an ANOVA in the case of K > 2. For a 2 group comparison, 
the ordinary t test statistic for CpG site j is 




where v is the appropriate diagonal element from the positive definite matrix (X T X)~ 1 . Two-sided 
p-values can be computed from the t distribution with degrees of freedom equal to dj = n — p, where 
p is the number of parameters estimated in the linear model. 

ft is well established in the genomics field that performing an ordinary t-test results in many false 
positives, particularly for studies with smaller sample sizes (Efron et al, 2001; Tusher, Tibshirani and 
Chu, 2001; Lonnstcdt and Speed, 2002; Broberg, 2003; Wright and Simon, 2003). Hence, rather than 
calculating ordinary t-tests, once the absolute or squared deviations have been obtained, moderated 
t-statistics (Smyth, 2004), which employ empirical Bayes shrinkage of the s 2 _, are calculated rather 
than ordinary t-statistics. For full hierarchical model details and derivation of the moderated t-statistic 
please refer to Smyth (2004). The moderated t-statistic is defined as 
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where 

_ 2 _ d 0 s 2 z0 + djSj. 
Szj ~ d 0 + dj 

are the squeezed variances. The do and s 2 0 are hyperparameters of the hierarchical model that can be 
estimated using empirical Bayes estimation procedures, see Smyth (2004) for full model details. For 
differentially variable CpG sites, tj follows a scaled t distribution with degrees of freedom d 0 + dj. For 
CpG sites with no differences in variances, tj follows an unsealed t distribution with degrees of freedom 
d 0 + dj . 

Once p-values are obtained from the moderated t statistics, they are adjusted for multiple testing 
using the method of Benjamini and Hochberg (1995). 



3 Thresholding on the log ratio of group variances 

In addition to a p-valuc cut-off, a cut-off can be specified on the log ratio of the estimated group 
variances, defined as 



LogVarRatio = log 




Specifying a LogVarRatio of at least | log(2)| means that the variance of one group is at least twice that 
of the second group. In our cancer datasets, we specified a LogVarRatio cut-off of at least |log(5)|. 
The LogVarRatio's are symmetric about zero, with negative values meaning that the second group is 
more variable than the first, and positive values mean that the first group is more variable than the 
second. 
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